Use this dataset of airline arrival information to predict how late flights will be. A flight only counts as late if it is more than 30 minutes late.
In [2]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import math
from matplotlib.mlab import PCA as mlabPCA
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import preprocessing
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score, KFold, cross_val_predict
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA as sklearn_pca
from sklearn import decomposition
from sklearn import ensemble
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.utils import resample
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.feature_selection import RFE
In [3]:
# Read and import data
airlines = pd.read_csv('Airlines 2008.csv')
airlines.head()
Out[3]:
In [4]:
#Change the ArrDelay variable into a categorical variable
airlines.loc[airlines['ArrDelay'] <= 30, 'ArrDelay'] = 0
airlines.loc[airlines['ArrDelay'] > 30, 'ArrDelay'] = 1
In [5]:
#Check length of the dataset
len(airlines)
Out[5]:
In [6]:
#Type of info per column of the dataset
airlines.info()
In [7]:
#Substitute infinity and Nan values by Zeros and check that no Nans are left
airlines = airlines.replace([np.inf, -np.inf], np.nan)
airlines = airlines.fillna(0)
print(airlines.isnull().sum())
print(airlines.info())
In [8]:
#Drop "object" type features as they are not going to add information to the model
airlines = airlines.drop(airlines[['UniqueCarrier', 'TailNum','Origin','Dest','CancellationCode']],axis=1)
In [9]:
# Compute average number of delayed flights per hour
airlines['hour'] = airlines['CRSArrTime'].map(lambda x: int(str(int(x)).zfill(4)[:2]))
grouped = airlines[['ArrDelay', 'hour' ]].groupby('hour').mean()
# plot average delays by hour
grouped.plot(kind='bar')
Out[9]:
In [10]:
# Compute average number of delayed flights per day of the month
grouped = airlines[['ArrDelay', 'DayOfWeek']].groupby('DayOfWeek').mean()
# plot average delays by day of the month
grouped.plot(kind='bar')
Out[10]:
In [11]:
# Compute average number of delayed flights per day
grouped = airlines[['ArrDelay', 'DayofMonth']].groupby('DayofMonth').mean()
# plot average delays by day
grouped.plot(kind='bar')
Out[11]:
In [12]:
# Compute average number of delayed flights per month
grouped = airlines[['ArrDelay', 'Month']].groupby('Month').mean()
# plot average delays by month
grouped.plot(kind='bar')
Out[12]:
In [13]:
XA= airlines.drop('ArrDelay',axis=1)
yA = airlines['ArrDelay']
#Preprocess and scale data
names = XA.columns
XA_scaled = pd.DataFrame(preprocessing.scale(XA), columns = names)
summary = XA_scaled.describe()
XA_scaled.head()
Out[13]:
In [14]:
#Calculate Feature Importance using Random Forest
rf = RandomForestClassifier()
rf.fit(XA_scaled, yA)
#Define feature importance
feature_importance8 = rf.feature_importances_
# Make importances relative to max importance.
feature_importance8 = 100.0 * (feature_importance8 / feature_importance8.max())
sorted_idx = np.argsort(feature_importance8)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.figure(figsize=(7, 30))
plt.subplot(1, 1, 1)
plt.barh(pos, feature_importance8[sorted_idx], align='center')
plt.yticks(pos, XA_scaled.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('ArrDelay')
plt.show()
In [15]:
#Feature Selection. Scores for the most relevant features (should we start with the one that has more explanatory power)
from sklearn.feature_selection import SelectKBest
# feature extraction
test = SelectKBest()
fit = test.fit(XA_scaled, yA)
#Identify features with highest score from a predictive perspective (for all programs)
names2 = XA_scaled.columns
Bestfeatures = pd.DataFrame(fit.scores_, index = names2)
Bestfeatures.columns = ['Best Features']
Bestfeatures.sort_values(by=['Best Features'], ascending=False)
Out[15]:
In [29]:
#From Feature selection, we´ll only keep the most meaningful features
fs = ['DepDelay','NASDelay','LateAircraftDelay','CarrierDelay','TaxiOut','WeatherDelay']
#From Recursive Feature Elimination - considering from Feature selection WeatherDelay only DepTime has been included in the list
rfe = ['DepDelay', 'LateAircraftDelay', 'NASDelay', 'CarrierDelay', 'TaxiOut', 'DepTime', 'WeatherDelay']
#From Features Selection and PCA analysis (only three features are meaningful+ArrDelay)
fps = ['ArrDelay','NASDelay','LateAircraftDelay','CarrierDelay','TaxiOut']
#Set up the new dataset based on the meaningful features
airlines2=airlines[fps]
In [30]:
#Check the new dataset
airlines2.head()
Out[30]:
In [31]:
#Count number of datapoints for ArrDelay
airlines2['ArrDelay'].value_counts()
Out[31]:
In [32]:
#Downsample majority class (due to computational restrictions we downsample the majority instead of upsampling the minority)
# Separate majority and minority classes
airlines_majority = airlines2[airlines2.ArrDelay==0]
airlines_minority = airlines2[airlines2.ArrDelay==1]
# Downsample airlines majority
airlines_majority_downsampled = resample(airlines_majority, replace=False, n_samples=901398, random_state=123)
# Combine minority class with downsampled majority class
airlines_downsampled = pd.concat([airlines_majority_downsampled, airlines_minority])
# Display new class counts
airlines_downsampled.ArrDelay.value_counts()
Out[32]:
In [33]:
#Preprocess and scale data to fit into the models
airlines_downsampled2=airlines_downsampled.drop('ArrDelay',axis=1)
names = airlines_downsampled2.columns
X2 = pd.DataFrame(preprocessing.scale(airlines_downsampled2), columns = names)
X2.describe()
Out[33]:
In [34]:
#Define Outcome & Predictors
y = airlines_downsampled['ArrDelay']
X2 = airlines_downsampled2
#Split into test and train sets
X_train, X_test, y_train, y_test = train_test_split(X2, y, test_size=0.3, random_state=42)
#KFold for cross validation analysis
kf = KFold(5)
PCA Analysis
In [35]:
# Build up the correlation mtrix
Z = X2
correlation_matrix = Z.corr()
In [36]:
#Eigenvectores & Eigenvalues
eig_vals, eig_vecs = np.linalg.eig(correlation_matrix)
# Inspecting the eigenvalues and eigenvectors.
for i in range(len(eig_vals)):
eigvecs = eig_vecs[:, i].reshape(1, len(Z.columns)).T
print('Eigenvector {}: \n{}'.format(i + 1, eigvecs))
print('Eigenvalue {}: {}'.format(i + 1, eig_vals[i]))
print(40 * '-')
sklearn_pca = PCA(n_components=len(Z.columns))
Y_sklearn = sklearn_pca.fit_transform(correlation_matrix)
print(
'The percentage of total variance in the dataset explained by each',
'component from Sklearn PCA.\n',
sklearn_pca.explained_variance_ratio_
)
In [41]:
#PCA Analysis
# Create a scaler object
sc = StandardScaler()
# Fit the scaler to the features and transform
X_std = sc.fit_transform(X2)
# Create a PCA object
pca = decomposition.PCA(n_components=3)
# Fit the PCA and transform the data
X_std_pca = pca.fit_transform(X_std)
# View the new feature data's shape
X_std_pca.shape
# Create a new dataframe with the new features
XPCA = pd.DataFrame(X_std_pca)
XPCA.head()
Out[41]:
Logistic Regression
In [42]:
# Initialize and fit the model.
lr = LogisticRegression()
fittrain = lr.fit(X_train,y_train)
fittest = lr.fit(X_test,y_test)
# Predict on training set
predtrain_y = lr.predict(X_train)
predtest_y = lr.predict(X_test)
In [43]:
from sklearn.metrics import classification_report
#Training Scores
target_names = ['0.0', '1.0']
print(classification_report(y_train, predtrain_y, target_names=target_names))
cnf = confusion_matrix(y_train, predtrain_y)
print(cnf)
# Accuracy tables.
table_train = pd.crosstab(y_train, predtrain_y, margins=True)
train_tI_errors = table_train.loc[0.0,1.0] / table_train.loc['All','All']
train_tII_errors = table_train.loc[1.0,0.0] / table_train.loc['All','All']
print((
'Training set accuracy:\n'
'Percent Type I errors: {}\n'
'Percent Type II errors: {}\n\n'
).format(train_tI_errors, train_tII_errors))
In [44]:
#Testing Scores
target_names = ['0.0', '1.0']
print(classification_report(y_test, predtest_y, target_names=target_names))
cnf = confusion_matrix(y_test, predtest_y)
print(cnf)
table_test = pd.crosstab(y_test, predtest_y, margins=True)
test_tI_errors = table_test.loc[0.0,1.0]/table_test.loc['All','All']
test_tII_errors = table_test.loc[1.0,0.0]/table_test.loc['All','All']
print((
'Test set accuracy:\n'
'Percent Type I errors: {}\n'
'Percent Type II errors: {}'
).format(test_tI_errors, test_tII_errors))
print(cross_val_score(lr,X2,y,cv=kf))
print(cross_val_score(lr,X2,y,cv=kf).mean())
print('Cross validation PCA:', cross_val_score(lr,XPCA,y,cv=kf).mean())
Random Forest
In [45]:
# Train model
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf.fit(X_test, y_test)
# Predict on training set
predtrainrf_y = rf.predict(X_train)
predtestrf_y = rf.predict(X_test)
In [46]:
#Training Scores
target_names = ['0', '1']
print(classification_report(y_train, predtrainrf_y, target_names=target_names))
cnf = confusion_matrix(y_train, predtrainrf_y)
print(cnf)
# Accuracy tables.
table_train = pd.crosstab(y_train, predtrainrf_y, margins=True)
train_tI_errors = table_train.loc[0.0,1.0] / table_train.loc['All','All']
train_tII_errors = table_train.loc[1.0,0.0] / table_train.loc['All','All']
print((
'Training set accuracy:\n'
'Percent Type I errors: {}\n'
'Percent Type II errors: {}\n\n'
).format(train_tI_errors, train_tII_errors))
In [47]:
#Test Scores
target_names = ['0', '1']
print(classification_report(y_test, predtestrf_y, target_names=target_names))
cnf = confusion_matrix(y_test, predtestrf_y)
print(cnf)
table_test = pd.crosstab(y_test, predtestrf_y, margins=True)
test_tI_errors = table_test.loc[0.0,1.0]/table_test.loc['All','All']
test_tII_errors = table_test.loc[1.0,0.0]/table_test.loc['All','All']
print((
'Test set accuracy:\n'
'Percent Type I errors: {}\n'
'Percent Type II errors: {}'
).format(test_tI_errors, test_tII_errors))
print(cross_val_score(rf,X2,y,cv=kf))
print(cross_val_score(rf,X2,y,cv=kf).mean())
print('Cross validation PCA:', cross_val_score(rf,XPCA,y,cv=kf).mean())
Naive - Bayes
In [48]:
# Train model
bnb = BernoulliNB()
bnb.fit(X_train, y_train)
bnb.fit(X_test, y_test)
# Predict on training set
predtrainbnb_y = bnb.predict(X_train)
predtestbnb_y = bnb.predict(X_test)
In [49]:
#Training Scores
target_names = ['0', '1']
print(classification_report(y_train, predtrainbnb_y, target_names=target_names))
cnf = confusion_matrix(y_train, predtrainbnb_y)
print(cnf)
# Accuracy tables.
table_train = pd.crosstab(y_train, predtrainbnb_y, margins=True)
train_tI_errors = table_train.loc[0.0,1.0] / table_train.loc['All','All']
train_tII_errors = table_train.loc[1.0,0.0] / table_train.loc['All','All']
print((
'Training set accuracy:\n'
'Percent Type I errors: {}\n'
'Percent Type II errors: {}\n\n'
).format(train_tI_errors, train_tII_errors))
In [50]:
#Test Scores
target_names = ['0', '1']
print(classification_report(y_test, predtestbnb_y, target_names=target_names))
cnf = confusion_matrix(y_test, predtestbnb_y)
print(cnf)
table_test = pd.crosstab(y_test, predtestbnb_y, margins=True)
test_tI_errors = table_test.loc[0.0,1.0]/table_test.loc['All','All']
test_tII_errors = table_test.loc[1.0,0.0]/table_test.loc['All','All']
print((
'Test set accuracy:\n'
'Percent Type I errors: {}\n'
'Percent Type II errors: {}'
).format(test_tI_errors, test_tII_errors))
print(cross_val_score(bnb,X2,y,cv=kf))
print(cross_val_score(bnb,X2,y,cv=kf).mean())
print('Cross validation PCA:', cross_val_score(bnb,XPCA,y,cv=kf).mean())
Gradient Boosting Classifier
In [51]:
# We'll make 500 iterations, use 2-deep trees, and set our loss function.
params = {'n_estimators': 500,
'max_depth': 2,
'loss': 'deviance'}
# Initialize and fit the model.
clf = ensemble.GradientBoostingClassifier(**params)
clf.fit(X_train, y_train)
clf.fit(X_test, y_test)
# Predict on training set
predtrainclf_y = clf.predict(X_train)
predtestclf_y = clf.predict(X_test)
In [52]:
#Training Scores
target_names = ['0.0', '1.0']
print(classification_report(y_train, predtrainclf_y, target_names=target_names))
cnf = confusion_matrix(y_train, predtrainclf_y)
print(cnf)
# Accuracy tables.
table_train = pd.crosstab(y_train, predtrainclf_y, margins=True)
train_tI_errors = table_train.loc[0.0,1.0] / table_train.loc['All','All']
train_tII_errors = table_train.loc[1.0,0.0] / table_train.loc['All','All']
print((
'Training set accuracy:\n'
'Percent Type I errors: {}\n'
'Percent Type II errors: {}\n\n'
).format(train_tI_errors, train_tII_errors))
In [53]:
#Test Scores
target_names = ['0.0', '1.0']
print(classification_report(y_test, predtestclf_y, target_names=target_names))
cnf = confusion_matrix(y_test, predtestclf_y)
print(cnf)
table_test = pd.crosstab(y_test, predtestclf_y, margins=True)
test_tI_errors = table_test.loc[0.0,1.0]/table_test.loc['All','All']
test_tII_errors = table_test.loc[1.0,0.0]/table_test.loc['All','All']
print((
'Test set accuracy:\n'
'Percent Type I errors: {}\n'
'Percent Type II errors: {}'
).format(test_tI_errors, test_tII_errors))
print(cross_val_score(clf,X2,y,cv=kf))
print(cross_val_score(clf,X2,y,cv=kf).mean())
print('Cross validation PCA:', cross_val_score(clf,XPCA,y,cv=kf).mean())
LDA Classifier
In [54]:
# LDA Classification
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)
lda.fit(X_test, y_test)
# Predict on training set
predtrainlda_y = lda.predict(X_train)
predtestlda_y = lda.predict(X_test)
In [55]:
#Training Scores
target_names = ['0.0', '1.0']
print(classification_report(y_train, predtrainlda_y, target_names=target_names))
cnf = confusion_matrix(y_train, predtrainlda_y)
print(cnf)
# Accuracy tables.
table_train = pd.crosstab(y_train, predtrainlda_y, margins=True)
train_tI_errors = table_train.loc[0.0,1.0] / table_train.loc['All','All']
train_tII_errors = table_train.loc[1.0,0.0] / table_train.loc['All','All']
print((
'Training set accuracy:\n'
'Percent Type I errors: {}\n'
'Percent Type II errors: {}\n\n'
).format(train_tI_errors, train_tII_errors))
In [56]:
#Test Scores
target_names = ['0.0', '1.0']
print(classification_report(y_test, predtestlda_y, target_names=target_names))
cnf = confusion_matrix(y_test, predtestlda_y)
print(cnf)
table_test = pd.crosstab(y_test, predtestlda_y, margins=True)
test_tI_errors = table_test.loc[0.0,1.0]/table_test.loc['All','All']
test_tII_errors = table_test.loc[1.0,0.0]/table_test.loc['All','All']
print((
'Test set accuracy:\n'
'Percent Type I errors: {}\n'
'Percent Type II errors: {}'
).format(test_tI_errors, test_tII_errors))
print(cross_val_score(lda,X2,y,cv=kf))
print(cross_val_score(lda,X2,y,cv=kf).mean())
print('Cross validation PCA:', cross_val_score(lda,XPCA,y,cv=kf).mean())
Before running the model, the data was cleaned. The NaN and values values were replaced by zeros. To run the model and predict the arrival delay, the first thing that was done was to select the most meaningful features. There was an overfitting problem when DepDelayed was chosen as one of the variables explaining the arrivals delayed. Hence, the following features were selected:
'NASDelay' 'LateAircraftDelay' 'CarrierDelay' 'TaxiOut'
The modesl still present some overfitting as the PCA scores are lower than the scores of the models through crossvalidation but are closer to each other.
To reduce the bias in the classifiers, as the delays and on-time arrivals were unevenly spread in the dataset, the dataset was resampled. In this case and due to the computing power requirements, the majority (on-time arrivals) was downsampled.
The resampled dataset was split into a train and a test dataset at 30% test dataset ratio.
The new dataset was run through different models: Naive Bayes, LDA, RandomForest, Logistic Regression and Gradient Boosting Classifier. Every model was run on a PCA predictors to see the max accuracy that the model was able to obtain through cross validation.
The model scoring a better result in the test sample was Random Forest scorring 97% and having a reduced % of type I and type II errors in the model. The model scoring the worst result was LDA.
The SVC model was not run (although it was tried) due to computational restrictions. More than 2 hours were required to run it for the first time and the model din´t run through all the data.